Instructions

1 [WALKTRHOUGH]: Marginalizing out latent discrete parameters in Stan

Stan, like many other state-of-the-art Bayesian samplers, uses HMC/NUTS, but that requires the exploration of a smooth posterior surface. Out of the box, Stan is unable to treat latent discrete parameters. A solution that is often (but not always) feasible is to marginalize out these latent discrete parameters. In the following, we look at an example where inferring latent discrete parameters is an attractive modeling approach. We will then show (in math) how to “marginalize out” these discrete parameters and finally implement this in Stan.

The ‘exam scores’ example

The example we look at here is the ‘exam scores’ case from Lee & Wagenmakers (2015, §6.1), which is also provided as part of the optional reading material (just 2 pages!) on Courseware. The idea is this. \(p=15\) people answered \(n=40\) exam questions. Exam questions where 2-option forced-choice questions. Every participant answered every question. The vector \(k\) below contains how many out of the 40 questions each person answered correctly:

Here’s a plot of the ‘exam score’ data. The \(x\)-axis has the participants (ordered by the sequence of occurrence in the data vector). The \(y\)-axis shows the proportion of correct answers.

Suppose that we had reason to suspect that some people were prepared for the exam, some others were just guessing randomly. We’d like to use the data to infer:

  1. how likely it is that participant \(i\) was just guessing (with chance of correctness \(0.5\)); and
  2. what the average rate of correctly answering a question is among those who did not just guess randomly.

Notice that the first question requires reasoning about the state of a binary (hence: discrete) latent variable: whether person \(i\) is a guesser or not.

Looking at the data, intuitively, we’d probably want to say something like: the first five people were most likely guessers; the others were most likely not guessing but gave informed answers. Let’s see if we can write a Bayesian model than formalizes and tests this intuition.

The model used by Lee & Wagenmakers is shown below:

The model features a latent discrete variable \(z_i\) for each participant \(i\). Depending on the value of \(z_i\), the probability of answering correctly is either \(0.5\) for guessers, or an unknown value \(\phi\) which is to be estimated from the data.

The problem is that HMC/NUTS cannot deal with latent discrete parameters, since it wants to ride its sleigh on smooth surfaces, so to speak. That’s why we do some math, to present the sampler with a smooth surface to sleigh on after all. We simply “remove” the latent discrete variables. More concretely, we compute their accumulated impact and pass that calculation back to the sampler, so that no information is lost. Here’s how we do this.

Marginalization: setting the scene

Let’s start at the very beginning of Bayesian inference. We want to compute the posterior, which in non-normalized form we can write as:

\[ P(\theta \mid D) \propto P(\theta) \ P(D \mid \theta) \]

Our sampler indeed operates by looking at what we may call the non-normalized score \(P(\theta) \ P(D \mid \theta)\), which is on the right-hand side above, but it actually considers (for number precision), the log of that. So, at each point during the sampling the sampler is interested in this log-score \(\log \left (P(\theta) \ P(D \mid \theta) \right)\) for different instantiations of parameter vector \(\theta\). It uses this log-score to calculate the shape of the surface to sleigh on, so to speak.

Notice that in general:

\[ \log P(\theta, D ) = \log \left(P(\theta) P(D \mid \theta) \right) = \log P(\theta) + \log P(D \mid \theta) \]

Our goal is to write down the log-score in such a way that all discrete parameters are removed. If \(\theta\) consists of \(\theta'\), a vector of continuous parameters, and \(\eta\) a vector of discrete parameters taking on values \(\eta_1, \dots, \eta_n\), the marginalization is this:

\[ \log P(\theta', D) = \log \sum_i \left( P(\theta', \eta_i, D) \right) \]

Marginalization for the ‘exam scores’ case

Now specifically, for our concrete ‘exam scores’ example, the log-score is a big sum over the parts contributed by each participant \(i\):

\[ \log P(D, \theta) = \sum_i \log P(k_i, n, z_i, \phi) \]

The contribution of each single participant \(i\) is:

\[\log P(k_i, n, z_i, \phi) = \log \text{Bern}(z_i \mid 0.5) + \log \text{Binom}(k_i, n, \text{ifelse}(z_i=1, \phi, 0.5))\]

Our goal is to get rid of the variable \(z_i\) in this latter representation. Since each \(z_i\) can only ever take on one of finitely many values (in fact: only one of two), we can therefore marginalize out the variables \(z_i\) by a sum, so to speak, iterating over all possible values of \(z_i\). We do this first here for the score, not the log-score, because that is less clutter:

\[ \begin{align*} P(k_i, n, \phi) & = \sum_{z_i \in \{0,1\}} \text{Bern}(z_i \mid 0.5) \ \text{Binom}(k_i, n, \text{ifelse}(z_i=1, \phi, 0.5)) \\ & = \text{Bern}(0 \mid 0.5) \ \text{Binom}(k_i, n, 0.5)) + \text{Bern}(1 \mid 0.5) \ \text{Binom}(k_i, n, \phi)) \\ & = \underbrace{0.5 \ \text{Binom}(k_i, n, 0.5))}_{\exp \alpha_0} + \underbrace{0.5 \ \text{Binom}(k_i, n, \phi))}_{\exp \alpha_1} \end{align*} \]

Notice that in Stan, we would represent a summand like

\[ 0.5 \ \text{Binom}(k_i, n, 0.5)) \]

also as a log. (Think: Everything defaults \(\log\) for numerical precision; we have functions for log-probability, not probability density/mass, for instance.) This is why we have introduced variables \(\alpha_0\) and \(\alpha_1\) as the log of the summands we ended up with above. To be clear:

\[ \begin{align*} \alpha_0 &= \log 0.5 + \log \text{Binom}(k_i, n, 0.5) \\ \alpha_1 &= \log 0.5 + \log \text{Binom}(k_i, n, \phi) \end{align*} \]

With this, we can write the log-score contribution of each participant \(i\), with \(z_i\) marginalized out, as:

\[ \log P(k_i, n, \phi) = \log \sum_{i=0}^1 \exp \alpha_i \]

Stan has a useful built-in function log_sum_exp which computes exactly this last quantity for a given vector, like our \(\alpha\) here. Stan’s log_sum_exp is particularly useful because it does clever tricks in the back to ensure maximal numerical precision.

But that’s it! We have expressed the (log-)score of the model without direct reference to a latent discrete variable. We did this by “enumerating and folding” all the possible cases, i.e., all the possible values the discrete variables can take on.

Implementation in Stan

Now we can cast this into a Stan program. The code is shown below and included in file ADA-W09-Ex1a-exam-scores.stan. Notice that we only have a single latent parameter left, namely (continuous) \(\phi\). In the transformed parameters we are computing the vector \(\alpha\) carrying the log-summands of the marginalization, just like we expressed mathematically above. To obtain information about the marginalized-out parameters in \(z\), we sample them in the generated quantities block. This latter sampling is independent of the computation of the posterior (it happens after having determined each next sample), but it does, of course, depend on the posterior, so these are indeed samples of \(z_i\) from the posterior!

Notice also that the code includes the line target += log_sum_exp(alpha[i]);, which is syntax we have not seen so far. Here, the variable target is incremented with the log-score with \(z\) marginalized out (like computed mathematically above). The variable target is always available in Stan. It is what contains the log-score (as far as computed so far during the execution of the program).

data { 
  int<lower=1> p;  
  int<lower=0> k[p];   
  int<lower=1> n;
}
parameters {
  real<lower=.5,upper=1> phi; 
} 
transformed parameters {
  vector[2] alpha[p];
  for (i in 1:p) {
    alpha[i,1] = log(.5) + binomial_lpmf(k[i] | n, phi);
    alpha[i,2] = log(.5) + binomial_lpmf(k[i] | n, 0.5); 
  }
}
model {
  for (i in 1:p)
    target += log_sum_exp(alpha[i]);
}
generated quantities {
  int<lower=0,upper=1> z[p];
  for (i in 1:p) {
    z[i] = bernoulli_rng(softmax(alpha[i])[1]);
  }
}

We can run this model as usual:

The most interesting piece of information is the answer to our question of how likely each participant \(i\) is to be a genuine answerer or just a guesser. Here are the means of inferred values of each \(z_i\):

##  [1] 0.000 0.000 0.000 0.000 0.000 0.994 0.996 1.000 1.000 1.000 1.000 1.000
## [13] 1.000 1.000 1.000

Indeed, given the data and model, we should believe that the first five people are lousy guessers. Also, posterior uncertainty about this is quite low.

2 [HOMEWORK]: Finite Mixtures

Let’s look again (yawn) at our fictitious flower data. We take our measures from before but add 4 to each measure from group B (for educational reasons, as you’ll see presently).

Here’s how this data is distributed:

Now suppose that we get the data like this, without information which measure was from which group:

Data may often look like this, showing signs of bi- or multi-modality, i.e., having several “humps” or apparent local areas of higher concentration. If we fit a single Gaussian to this data it might look like this:

We are therefore trying to estimate a Gaussian mixture model (GMM). We take the simplest GMM with just two groups (because we see two “humps”, or have a priori information that there are exactly two groups; the bonus exercise looks at a generalization to \(K>2\) groups). Concretely, for each data point \(y_i\), \(i \in \{1, \dots, N\}\), we are going to estimate how likely data point \(i\) may have been a sample from normal distribution “Number 0”, with \(\mu_0\) and \(\sigma_0\), or from normal distribution “Number 1”, with \(\mu_1\) and \(\sigma_1\). Naturally, all \(\mu_{0,1}\) and \(\sigma_{0,1}\) are estimated from the data, as are the group-indicator variables \(z_i\). There is also a global parameter \(p\) which indicates how likely any data point is to come from one of the two distributions (you’ll think about which one below!). Here’s the full model we will work with (modulo an additional ordering constraint, as discussed below):

\[ \begin{align*} p & \sim \text{Beta}(1,1) \\ z_i & \sim \text{Bernoulli}(p) \\ \mu_{0,1} & \sim \mathcal{N}(12, 10) \\ \sigma_{0,1} & \sim \text{log-normal}(0, 2) \\ y_i & \sim \mathcal{N}(\mu_{z_i}, \sigma_{z_i}) \end{align*} \]

2.a Draw the model

Draw a graphical representation of this mixture model, following the conventions outlined here. You can draw on paper, take a picture, or draw by hand with a mouse in any drawing program (like this). Maybe use ASCII art. Anything is fine really! This does not need to look pretty. It just needs to be correct.

2.b Run the model & interpret the output

We are going to pack the data together for fitting the Stan model:

Below is the Stan code for this model. It is also given in file ADA-W09-Ex2b-GMM.stan. A few comments on this code:

  1. There is no occurrence of variable \(z_i\), as this is marginalized out. We do this following the same recipe as before and increment the log-score manually, using target += log_sum_exp(alpha).
  2. We declare vector mu to be of a particular type which we have not seen before. We want the vector to be ordered. We will come back to this later. Don’t worry about it now.
data {
  int<lower=1> N; 
  real y[N];      
}
parameters {
  real<lower=0,upper=1> p;         
  ordered[2] mu;             
  vector<lower=0>[2] sigma; 
}
model {
  p ~ beta(1,1);
  mu ~ normal(12, 10);
  sigma ~ lognormal(0, 1);
  for (i in 1:N) {
    vector[2] alpha;
    alpha[1] = log(p)   + normal_lpdf(y[i] | mu[1], sigma[1]);
    alpha[2] = log(1-p) + normal_lpdf(y[i] | mu[2], sigma[2]);
    target += log_sum_exp(alpha);
  }
}
## Inference for Stan model: ADA-W09-Ex2b-GMM.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## p           0.54    0.00 0.13    0.30    0.45    0.54    0.62    0.80  1130
## mu[1]      10.37    0.03 0.80    9.03    9.80   10.29   10.87   12.14   906
## mu[2]      15.70    0.02 0.60   14.30   15.41   15.78   16.11   16.66  1139
## sigma[1]    2.05    0.02 0.56    1.17    1.65    1.98    2.37    3.32  1065
## sigma[2]    1.42    0.01 0.46    0.76    1.11    1.35    1.65    2.57  1063
## lp__     -125.28    0.06 1.85 -129.99 -126.18 -124.87 -123.94 -122.95   871
##          Rhat
## p        1.00
## mu[1]    1.00
## mu[2]    1.01
## sigma[1] 1.00
## sigma[2] 1.00
## lp__     1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 18 07:12:51 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Interpret this outcome! Focus on parameters \(p\), \(\mu_1\) and \(\mu_2\). What does \(p\) capture in this implementation? Do the (mean) estimated values make sense?

Solution:

The posterior mean of parameter \[p\] indicates our posterior belief about the probability of assigning the current data point to group \[z_i = 1\], thus taking \[\mu\] and \[\sigma\] from the corresponding normal distribution “number 1”. The value of the sampled parameter is \[p \approx 0.54\]. Yes, the mean estimated values for both mu parameters make sense, as they indicate the heights in our flower data which we believe are most likely to be sampled, notably also given our prior with a mean of 12 and stddev of 10 (also visually comparable with our plots above).

2.c An unidentifiable model

Let’s run the model in file ADA-W09-Ex2c-GMM.stan, which is exactly the same as before but with vector mu being an unordered vector of reals.

Here’s a summary of the outcome:

## Inference for Stan model: ADA-W09-Ex2c-GMM.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## p           0.51    0.01 0.14    0.21    0.42    0.51    0.61    0.79    99
## mu[1]      12.61    1.55 2.69    9.14   10.20   11.39   15.60   16.44     3
## mu[2]      13.47    1.59 2.74    9.23   10.46   15.12   15.90   16.55     3
## sigma[1]    1.83    0.20 0.64    0.82    1.36    1.74    2.23    3.20    10
## sigma[2]    1.66    0.19 0.60    0.79    1.22    1.56    2.01    3.09    10
## lp__     -127.07    0.07 1.94 -132.01 -127.99 -126.64 -125.68 -124.67   859
##          Rhat
## p        1.04
## mu[1]    1.97
## mu[2]    2.03
## sigma[1] 1.15
## sigma[2] 1.15
## lp__     1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 18 07:14:07 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Tell us what is remarkable here and explain why this happened. Explain in what sense this model is “unidentifiable”.

Hint: Explore the parameters with high \(\hat{R}\) values. When a model fit seems problematic, a nice tool to explore what might be amiss is the package shinystan. You could do this:

Then head over to the tab “Explore” and have a look at some of the parameters.

Solution:

Since we did not use an ordered vector variable for mu parameters, the model is not able to capture the data appropriately. Especially, we find that the chains for parameters mu do not converge, also indicated by high autocorrelation values, as the density shows signs of bimodality. Thus the posterior estimates are not representative of the grouping in our data.

2.d Posterior predictive check

Extend the model from 2b to also output samples from the posterior predictive distribution. Save your code in a file ADA-W09-Ex2d-GMM.stan. Run the model, collect the posterior predictive samples in a variable called yrep and draw a density plot. Does this look like a distribution that could have generated the data? You can use the code below once the model is coded up.

data {
  int<lower=1> N; 
  real y[N];      
}
parameters {
  real<lower=0,upper=1> p;         
  ordered[2] mu;             
  vector<lower=0>[2] sigma; 
}
model {
  p ~ beta(1,1);
  mu ~ normal(12, 10);
  sigma ~ lognormal(0, 1);
  for (i in 1:N) {
    vector[2] alpha;
    alpha[1] = log(p)   + normal_lpdf(y[i] | mu[1], sigma[1]);
    alpha[2] = log(1-p) + normal_lpdf(y[i] | mu[2], sigma[2]);
    target += log_sum_exp(alpha);
  }
}
generated quantities {
  real yrep[N];
  for(i in 1:N/2) {
    yrep[i] = normal_rng(mu[1], sigma[1]);
  }
  for(i in (N/2)+1:N) {
    yrep[i] = normal_rng(mu[2], sigma[2]);
  }
}
## Inference for Stan model: ADA-W09-Ex2d-GMM.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## yrep[1]    10.42    0.04 2.34    5.95    8.93   10.28   11.80   15.43  3054
## yrep[2]    10.34    0.05 2.26    6.08    8.88   10.23   11.68   15.24  2381
## yrep[3]    10.31    0.04 2.31    5.86    8.79   10.25   11.70   15.18  2880
## yrep[4]    10.41    0.04 2.35    6.34    8.88   10.21   11.75   15.57  3026
## yrep[5]    10.40    0.04 2.27    6.16    8.95   10.28   11.78   15.23  2903
## yrep[6]    10.33    0.04 2.35    6.12    8.80   10.20   11.64   15.46  3027
## yrep[7]    10.38    0.04 2.27    6.24    8.90   10.27   11.68   15.50  2744
## yrep[8]    10.36    0.04 2.29    6.15    8.87   10.20   11.71   15.47  3119
## yrep[9]    10.36    0.04 2.29    6.25    8.84   10.12   11.70   15.45  2678
## yrep[10]   10.40    0.05 2.33    6.17    8.89   10.27   11.77   15.44  2571
## yrep[11]   10.42    0.04 2.30    6.22    8.87   10.32   11.74   15.37  2973
## yrep[12]   10.37    0.04 2.28    6.16    8.87   10.26   11.74   15.31  2627
## yrep[13]   10.37    0.04 2.29    6.26    8.84   10.20   11.71   15.39  2635
## yrep[14]   10.38    0.04 2.27    6.10    8.88   10.24   11.77   15.18  2980
## yrep[15]   10.35    0.04 2.29    6.26    8.83   10.19   11.68   15.24  3182
## yrep[16]   10.44    0.04 2.27    6.44    8.98   10.26   11.80   15.29  2589
## yrep[17]   10.33    0.05 2.34    6.02    8.85   10.19   11.67   15.48  2322
## yrep[18]   10.37    0.04 2.27    6.27    8.91   10.23   11.69   15.42  3181
## yrep[19]   10.37    0.04 2.26    6.19    8.87   10.22   11.73   15.43  2935
## yrep[20]   10.47    0.04 2.29    6.27    8.94   10.29   11.85   15.59  3015
## yrep[21]   10.39    0.05 2.32    6.17    8.88   10.20   11.68   15.60  2540
## yrep[22]   10.35    0.04 2.27    6.21    8.88   10.19   11.72   15.11  2634
## yrep[23]   10.37    0.04 2.33    6.14    8.87   10.25   11.73   15.48  2922
## yrep[24]   10.37    0.04 2.30    6.22    8.84   10.25   11.72   15.39  2709
## yrep[25]   10.40    0.04 2.28    6.30    8.87   10.25   11.74   15.32  2818
## yrep[26]   15.69    0.03 1.62   12.26   14.80   15.77   16.72   18.62  3181
## yrep[27]   15.70    0.03 1.62   12.18   14.81   15.81   16.70   18.66  3078
## yrep[28]   15.70    0.03 1.60   12.07   14.79   15.82   16.72   18.62  3451
## yrep[29]   15.68    0.03 1.64   11.90   14.83   15.78   16.71   18.63  3576
## yrep[30]   15.70    0.03 1.64   11.95   14.80   15.83   16.73   18.56  3709
## yrep[31]   15.73    0.03 1.62   12.30   14.79   15.82   16.76   18.73  3073
## yrep[32]   15.70    0.03 1.65   12.07   14.84   15.85   16.74   18.58  3259
## yrep[33]   15.62    0.03 1.64   12.00   14.73   15.78   16.69   18.47  2666
## yrep[34]   15.71    0.03 1.59   12.22   14.81   15.83   16.73   18.68  3312
## yrep[35]   15.72    0.03 1.63   12.20   14.84   15.82   16.73   18.61  3535
## yrep[36]   15.73    0.03 1.62   12.30   14.85   15.81   16.75   18.71  2903
## yrep[37]   15.72    0.03 1.64   12.11   14.81   15.82   16.75   18.69  3245
## yrep[38]   15.68    0.03 1.63   12.06   14.80   15.82   16.70   18.63  3168
## yrep[39]   15.69    0.03 1.62   12.18   14.78   15.82   16.73   18.61  2811
## yrep[40]   15.70    0.03 1.63   12.15   14.78   15.80   16.75   18.71  3032
## yrep[41]   15.69    0.03 1.61   12.19   14.78   15.78   16.71   18.66  2836
## yrep[42]   15.70    0.03 1.64   12.00   14.79   15.85   16.72   18.69  3234
## yrep[43]   15.74    0.03 1.60   12.09   14.81   15.85   16.80   18.58  2891
## yrep[44]   15.69    0.03 1.59   12.21   14.79   15.80   16.70   18.55  2957
## yrep[45]   15.70    0.03 1.63   12.15   14.78   15.81   16.73   18.65  3308
## yrep[46]   15.70    0.03 1.63   12.10   14.79   15.84   16.72   18.69  3494
## yrep[47]   15.69    0.03 1.63   12.04   14.80   15.79   16.74   18.50  3315
## yrep[48]   15.73    0.03 1.62   11.98   14.85   15.86   16.75   18.64  3214
## yrep[49]   15.70    0.03 1.59   12.22   14.77   15.79   16.71   18.61  3151
## yrep[50]   15.70    0.03 1.61   12.08   14.79   15.84   16.75   18.57  3710
## lp__     -125.40    0.07 1.94 -130.30 -126.34 -124.94 -123.99 -122.98   723
##          Rhat
## yrep[1]  1.00
## yrep[2]  1.00
## yrep[3]  1.00
## yrep[4]  1.00
## yrep[5]  1.00
## yrep[6]  1.00
## yrep[7]  1.00
## yrep[8]  1.00
## yrep[9]  1.00
## yrep[10] 1.00
## yrep[11] 1.00
## yrep[12] 1.00
## yrep[13] 1.00
## yrep[14] 1.00
## yrep[15] 1.00
## yrep[16] 1.00
## yrep[17] 1.00
## yrep[18] 1.00
## yrep[19] 1.00
## yrep[20] 1.00
## yrep[21] 1.00
## yrep[22] 1.00
## yrep[23] 1.00
## yrep[24] 1.00
## yrep[25] 1.00
## yrep[26] 1.00
## yrep[27] 1.00
## yrep[28] 1.00
## yrep[29] 1.00
## yrep[30] 1.00
## yrep[31] 1.00
## yrep[32] 1.00
## yrep[33] 1.00
## yrep[34] 1.00
## yrep[35] 1.00
## yrep[36] 1.00
## yrep[37] 1.00
## yrep[38] 1.00
## yrep[39] 1.00
## yrep[40] 1.00
## yrep[41] 1.00
## yrep[42] 1.00
## yrep[43] 1.00
## yrep[44] 1.00
## yrep[45] 1.00
## yrep[46] 1.00
## yrep[47] 1.00
## yrep[48] 1.00
## yrep[49] 1.00
## yrep[50] 1.00
## lp__     1.01
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 18 07:15:42 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Solution:

The output of the model fit suggests convergence for the posterior predictive data; also, visually, the Gaussian mixture model seems to predict new data very well.

2.e Using BRMS

We can also run this finite mixture model in brms. We saw earlier already that fitting the parameters of a single Gaussian is like fitting an intercept-only simple linear regression model. We can add finite mixtures to brms like so (the syntax for creating mixtures is not so important for us right now):

Let’s look at the model fit:

##  Family: mixture(gaussian, gaussian) 
##   Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
## Formula: y ~ 1 
##    Data: data_GMM (Number of observations: 50) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## mu1_Intercept    10.53      1.16     8.74    12.77 1.01      510      320
## mu2_Intercept    15.58      1.22    12.92    16.82 1.02      462      247
## 
## Family Specific Parameters: 
##        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma1     2.52      2.64     1.17     4.01 1.02      396      441
## sigma2     1.75      1.82     0.69     3.74 1.02      337      242
## theta1     0.54      0.18     0.07     0.88 1.02      524      315
## theta2     0.46      0.18     0.12     0.93 1.02      524      315
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s also look at the Stan code that brms produced in the background for this model in order to find out how this model is related to that of Ex 2.b:

## // generated with brms 2.12.0
## functions {
## }
## data {
##   int<lower=1> N;  // number of observations
##   vector[N] Y;  // response variable
##   vector[2] con_theta;  // prior concentration
##   int prior_only;  // should the likelihood be ignored?
## }
## transformed data {
## }
## parameters {
##   real<lower=0> sigma1;  // residual SD
##   real<lower=0> sigma2;  // residual SD
##   simplex[2] theta;  // mixing proportions
##   ordered[2] ordered_Intercept;  // to identify mixtures
## }
## transformed parameters {
##   // identify mixtures via ordering of the intercepts
##   real Intercept_mu1 = ordered_Intercept[1];
##   // identify mixtures via ordering of the intercepts
##   real Intercept_mu2 = ordered_Intercept[2];
##   // mixing proportions
##   real<lower=0,upper=1> theta1;
##   real<lower=0,upper=1> theta2;
##   theta1 = theta[1];
##   theta2 = theta[2];
## }
## model {
##   // initialize linear predictor term
##   vector[N] mu1 = Intercept_mu1 + rep_vector(0, N);
##   // initialize linear predictor term
##   vector[N] mu2 = Intercept_mu2 + rep_vector(0, N);
##   // priors including all constants
##   target += normal_lpdf(Intercept_mu1 | 12, 10);
##   target += student_t_lpdf(sigma1 | 3, 0, 10);
##   target += normal_lpdf(Intercept_mu2 | 12, 10);
##   target += student_t_lpdf(sigma2 | 3, 0, 10);
##   target += dirichlet_lpdf(theta | con_theta);
##   // likelihood including all constants
##   if (!prior_only) {
##     // likelihood of the mixture model
##     for (n in 1:N) {
##       real ps[2];
##       ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
##       ps[2] = log(theta2) + normal_lpdf(Y[n] | mu2[n], sigma2);
##       target += log_sum_exp(ps);
##     }
##   }
## }
## generated quantities {
##   // actual population-level intercept
##   real b_mu1_Intercept = Intercept_mu1;
##   // actual population-level intercept
##   real b_mu2_Intercept = Intercept_mu2;
## }

Now, your job. Look at the two previous outputs and answer the following questions:

  • Is the brms-model the exact same as the model in Ex 2.b?
  • What is the equivalent of the variable alpha from the model of Ex 2.b in this new brms-generated code?
  • What is the equivalent of the variable p from the model of Ex 2.b in this new brms-generated code?
  • Is the brms code generating posterior predictive samples?
  • What is the prior probability in the brms-generated model of any given data point \(y_i\) to be from the first or second mixture component? Can you even tell from the code?

Solution:

  • No the brms-model is not the same, as it, for instance, uses different prior distributions for the parameters.
  • The equivalent to variable alpha in the brms-code is variable ps, as one can see when calculating the likelihood.
  • The equivalent to variable p in the brms code is variable theta1.
  • No, the brms code is not generating posterior predictive samples, as there is no such target calculation in the generated quantities block.
  • Parametertheta is somehow sampled by the dirichlet log probability density. But this does not give us the prior probability, as theta is the vector comprising both theta1 and theta2, while just theta1 corresponds to our probability of any given data point being from the first mixture component.

3 [HOMEWORK]: Divergent transitions, and how to tame (at least some of) them

The “eight schools” example is a classic and a simple illustration of a hierarchical model. There are \(N =8\) pairs of observations, each pair from a different school. For each school \(i\), we have an estimated effect size \(y_i\) and an estimated standard error \(\sigma_i\) for the reported effect size. (The experiments conducted at each school which gave us these pairs investigated whether short-term coaching has a effect on SAT scores.)

We are interested in inferring the latent true effect size \(\theta_i\) for each school \(i\) that could have generated the observed effect size \(y_i\) given spread \(\sigma_i\).

We could assume that each school’s true effect size \(\theta_i\) is entirely independent of any other. In contrast, we could assume that there is a single true effect size for all schools \(\theta_i = \theta_j\) for all \(i\) and \(j\). Or, more reasonably, we let the data decide and consider a model that tries to estimate how likely it is that \(\theta_i\) and \(\theta_j\) for different schools \(i\) and \(j\) are similar or not.

To do so, we assume a hierarchical model. The true effect sizes \(\theta_i\) and \(\theta_j\) of schools \(i\) and \(j\) are assumed:

  1. to have played a role in (stochastically) generating the observed \(y_i\) and \(y_j\), and
  2. to be themselves (stochastically) generated by (a hierarchical) process that generates (and thereby possibly assimilates) the values of \(\theta_i\) and \(\theta_j\).

Concretely, the model takes the following form:

\[ \begin{align*} y_i & \sim \mathcal{N}(\theta_i, \sigma_i) \\ \theta_i & \sim \mathcal{N}(\mu, \sigma') \\ \mu & \sim \mathcal{N}(0, 10) \\ \sigma & \sim \text{half-Cauchy}(0, 10) \\ \end{align*} \]

3.a Draw the model

Draw a graphical representation of this mixture model, following the conventions outlined here. Again, any format which we can decipher easily is fine, as long as it is practical (and fun) for you.

3.b Run the model, inspect and explain the divergent transitions

The Stan code for this model is shown below and also included in file ADA-W09-Ex3a-8schools-centered.stan.

data {
  int<lower=0> N;
  vector[N] y;
  vector<lower=0>[N] sigma;
}
parameters {
  real mu;
  real<lower=0> sigma_prime;
  vector[N] theta;
}
model {
  mu ~ normal(0, 10);
  sigma_prime ~ cauchy(0, 10);
  theta ~ normal(mu, sigma_prime);
  y ~ normal(theta, sigma);
}

Normally, there are a lot of divergent transitions when you run this code:

## [1] 86

Let’s go explore these divergent transitions using shinystan. Execute the command below, go to the tab “Explore” in the Shiny App, select “Bivariate” and explore plots of \(\sigma'\) against \(\theta_i\) for different \(i\). Points that experienced divergent transitions are shown in red.

You can also produce your own (funnel) plots with the code shown below, which may be even clearer because it uses a log-transform. Again, points with divergencies are shown in red.

## Inference for Stan model: ADA-W09-Ex3a-8schools-centered.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##               mean se_mean   sd   2.5%    25%    50%    75% 97.5% n_eff Rhat
## mu            6.43    0.15 4.12  -1.72   3.77   6.24   9.24 14.59   760 1.01
## sigma_prime   5.38    0.25 3.93   0.93   2.43   4.39   7.27 15.36   245 1.02
## theta[1]      9.55    0.28 7.36  -2.99   4.66   8.34  13.36 26.92   697 1.01
## theta[2]      6.75    0.15 5.39  -4.07   3.65   6.42  10.13 17.62  1210 1.00
## theta[3]      5.36    0.19 6.68  -9.78   2.11   5.75   9.25 18.01  1259 1.00
## theta[4]      6.53    0.15 5.76  -5.17   3.24   6.17   9.98 18.57  1470 1.00
## theta[5]      4.54    0.17 5.59  -7.72   1.25   4.92   8.15 14.61  1135 1.00
## theta[6]      5.32    0.16 5.92  -7.77   2.13   5.36   8.94 16.85  1358 1.00
## theta[7]      9.06    0.20 6.17  -1.88   4.92   8.27  12.58 22.87   940 1.01
## theta[8]      7.00    0.19 6.83  -6.52   3.00   6.71  10.68 21.32  1292 1.00
## lp__        -16.88    0.52 5.76 -27.46 -20.99 -17.35 -12.95 -4.31   122 1.04
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 18 07:18:23 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Explain in your own intuitive terms why these divergent transitions occur. E.g., you might want to say something like: “Since the step size parameter is …, we see divergencies … because the more … this variable is, the more/less … that variable …”

Solution:

Divergent transitions can occur, when stan uses a large step size (the size of the discretisation parts) to achieve a general global step length that fits the whole posterior distribution landscape. Now, when the posterior space that is to be explored has a too sharply curved shape the step size is too large and therefore the posterior shape cannot be captured by the sampler. In this specific 8 school example, in areas of the posterior, where one step in theta does not change a lot in sigma_prime, the steps can be larger, but for the regions where one step in theta does indeed change a lot in sigma_prime, the steps then are too large to capture the curvature appropriately, thus resulting in some divergent iterations at that location.

3.c Non-centered parameterization

An alternative model, with so-called non-central parameterization does not have this problem with divergent transitions (they can still occur occasionally, though).

This non-central model can be written like so:

\[ \begin{align*} y_i & \sim \mathcal{N}(\theta_i, \sigma_i) \\ \theta_i & = \mu + \sigma' \eta_i \\ \eta_i & \sim \mathcal{N}(0, 1) \\ \mu & \sim \mathcal{N}(0, 10) \\ \sigma & \sim \text{half-Cauchy}(0, 10) \\ \end{align*} \]

Implement and run this model in Stan. Report if you got any divergent transitions, e.g., with command get_divergent_iterations applied to the stanfit object. Store the results in variable stan_fit_3c_8schoolsNC.

Solution:

data {
  int<lower=0> N;
  vector[N] y;
  vector<lower=0>[N] sigma;
}
parameters {
  real mu;
  real<lower=0> sigma_prime;
  vector[N] nu; // theta_raw
}
transformed parameters {
  vector[N] theta;
  theta = mu + sigma_prime*nu;
}
model {
  mu ~ normal(0, 10);
  sigma_prime ~ cauchy(0, 10);
  nu ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
## [1] 0

3.d Explain non-central parameterization

Let’s look at a plot similar to the one we looked at for the model with central parameterization in 3.b:

## Inference for Stan model: ADA-W09-Ex3c-8schoolsNC.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##              mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu           6.53    0.08 4.20  -1.80  3.72  6.50  9.40 14.57  2960    1
## sigma_prime  4.65    0.07 3.78   0.20  1.73  3.85  6.55 14.22  2604    1
## nu[1]        0.35    0.02 0.98  -1.59 -0.32  0.36  1.04  2.19  3575    1
## nu[2]        0.05    0.01 0.89  -1.70 -0.54  0.07  0.64  1.81  4299    1
## nu[3]       -0.16    0.01 0.97  -2.01 -0.81 -0.17  0.49  1.81  4400    1
## nu[4]        0.00    0.01 0.92  -1.78 -0.62  0.00  0.61  1.82  3845    1
## nu[5]       -0.27    0.01 0.90  -1.95 -0.86 -0.29  0.33  1.56  4658    1
## nu[6]       -0.13    0.02 0.93  -1.89 -0.74 -0.15  0.49  1.69  3565    1
## nu[7]        0.34    0.01 0.92  -1.54 -0.27  0.35  0.96  2.13  3772    1
## nu[8]        0.08    0.02 0.93  -1.75 -0.57  0.10  0.69  1.94  3779    1
## theta[1]     8.98    0.11 6.67  -2.09  4.65  8.28 12.29 24.04  3673    1
## theta[2]     6.86    0.09 5.55  -4.23  3.37  6.69 10.22 18.41  4030    1
## theta[3]     5.54    0.10 6.41  -8.66  2.04  5.86  9.46 17.43  4094    1
## theta[4]     6.56    0.09 5.71  -4.87  2.92  6.49 10.08 18.09  3932    1
## theta[5]     4.95    0.08 5.46  -6.83  1.71  5.23  8.66 14.97  4147    1
## theta[6]     5.66    0.09 5.75  -6.99  2.30  5.84  9.36 16.43  4081    1
## theta[7]     8.68    0.09 5.82  -1.78  4.78  8.32 12.13 21.54  4079    1
## theta[8]     7.04    0.11 6.39  -5.63  3.19  6.84 10.69 20.40  3458    1
## lp__        -5.85    0.07 2.38 -11.26 -7.25 -5.56 -4.08 -2.01  1301    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 18 07:19:23 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

What is the main striking difference (apart from the presence/absence of divergent transitions)? How is this difference a reason for why divergent transitions can be problematic? Is any estimated posterior mean for any parameter noticeably affected by this?

Solution:

The samples for sigma_prime against theta1 spread out downwards, since the estimated posterior for sigma_prime gets smaller in comparison to the centered model. We get the divergent transitions because we got very few data points such that the group level parameter (sigma_i) looks pretty much like the top level parameter (sigma_prime), since sigma_prime is very small. If we now want to explore the very sharp posterior space (in the ‘neck’/middle rather than in the tails) and the step size is really small, we miss posterior space. We get divergent iterations when we are sampling small values for sigma_prime, thus getting a bias away from the location where sigma_prime is small. Therefore, we are led to conclude that sigma_prime is larger than it really is.

4 [BONUS]: Finite mixtures with \(K>2\) groups

Let’s generalize the finite mixture model of exercise 2.b to be able to deal with more than 2 mixture components. Let’s do this for the following flower data:

4.a Implement the generalized mixture model

Pick up the code from the Stan User’s Guide on generalized finite mixtures. Change the following things:

  • include a prior on theta, namely a Dirichlet prior with the weights as given in data_flower_heights_4 above
  • change the priors for mu to a normal distribution with mean 35 and SD 40
  • change the priors for sigma to a log-normal distribution with mean 3 and spread 1

Solution:

Your answer here.

4.b Make HMC converge by helpful init

If you run the model you might find that this model is slow to converge. Try specifying init values that make the model converge. This can be as blunt as you want, just make it converge (if you can). After this, comment on how useful or legit it is, in your opinion, to tweak the initialization of parameters in order to enable convergence and/or ensure a quicker, more efficient fitting with HMC.

Solution:

Your answer here.